To visualize & analyze card ownership, potential relationships and suspicious activities with the employee disappearance incident
Continuing the study from Part 1.
Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data?
If one consumption time fall with one car stop period, we believe it’s possible one correspondence. There might be several different purchase in one car stop period, but we can infer that the most common pair within one group is the most likely true pair.
The credit card data contain specific time, but loyalty card data doesn’t. Thus, We will find the relationship between credit cards and loyalty cards. After that, we will match the credit card and car stop. The owners of loyalty card can be inferred from the relationship between credit card and car stop and the relationship between credit cards and loyalty cards.
We full join the two card table by matching day, location and price. It’s seldom that two different consumption will have the same values in the three features. Then, we use group_by() to find all pairs of two cards and count the consumption frequency of the pair.
There will be some rows which can’t match. This might be someone used only one of the two cards or got cashback. We filter out these situations, where the card pair contains null value.
# # make a full join
card_correspond_count <- full_join(cc, loyalty,
by = c("day", "location", "price")) %>%
# calculate frequency
group_by(last4ccnum, loyaltynum) %>%
summarise(count = n()) %>%
# filter out mismatch
drop_na()
# convert 'last4ccnum' into string to plot
card_correspond_count$last4ccnum <- as.character(card_correspond_count$last4ccnum)
Most pairs are one-on-one. It’s confident to conclude there pairs are true (credit and loyalty card in each pair belong to one owner).
card_correspond_count_one2one <- card_correspond_count %>%
filter((n_distinct(last4ccnum)==1 & n_distinct(loyaltynum)==1))
knitr::kable(card_correspond_count_one2one,
caption = "One-on-one Matched Pairs") %>%
kableExtra::kable_paper("hover", full_width = F) %>%
kableExtra::scroll_box(height = "300px")
| last4ccnum | loyaltynum | count |
|---|---|---|
| 1310 | L8012 | 21 |
| 1321 | L4149 | 22 |
| 1415 | L7783 | 24 |
| 1874 | L4424 | 25 |
| 1877 | L3014 | 18 |
| 2142 | L9637 | 25 |
| 2276 | L3317 | 10 |
| 2418 | L9018 | 20 |
| 2463 | L6886 | 23 |
| 2540 | L5947 | 20 |
| 2681 | L1107 | 20 |
| 3484 | L2490 | 24 |
| 3492 | L7814 | 22 |
| 3506 | L7761 | 6 |
| 3547 | L9362 | 14 |
| 3853 | L1485 | 22 |
| 4434 | L2169 | 26 |
| 4530 | L8477 | 10 |
| 5010 | L2459 | 5 |
| 5407 | L4034 | 20 |
| 6691 | L6267 | 20 |
| 6816 | L8148 | 20 |
| 6895 | L3366 | 21 |
| 6899 | L6267 | 23 |
| 6901 | L9363 | 28 |
| 7108 | L6544 | 16 |
| 7117 | L6417 | 28 |
| 7253 | L1682 | 24 |
| 7354 | L9254 | 21 |
| 7384 | L3800 | 26 |
| 7688 | L4164 | 22 |
| 7792 | L5756 | 20 |
| 7819 | L5259 | 27 |
| 8129 | L8328 | 21 |
| 8156 | L5224 | 22 |
| 8202 | L2343 | 25 |
| 8411 | L6110 | 27 |
| 8642 | L2769 | 12 |
| 9152 | L5485 | 2 |
| 9220 | L4063 | 9 |
| 9241 | L3288 | 13 |
| 9405 | L3259 | 21 |
| 9551 | L5777 | 12 |
| 9614 | L5924 | 2 |
| 9617 | L5553 | 26 |
| 9635 | L3191 | 24 |
| 9683 | L7291 | 18 |
| 9735 | L9633 | 16 |
For those who have matched but not one-on-one, we can plot their parallel graph.
# filter multiple match
card_correspond_count_others <- card_correspond_count %>%
filter(n_distinct(last4ccnum)>1 | n_distinct(loyaltynum)>1)
card_correspond_count_others_plot <- card_correspond_count_others %>%
gather_set_data(1:2) %>% # <- ggforce helper function
arrange(x,last4ccnum,desc(loyaltynum))
# plot
ggplot(card_correspond_count_others_plot,
aes(x = x, id = id, split = y, value = count)) +
geom_parallel_sets(aes(fill = last4ccnum), alpha = 0.7,
axis.width = 0.2, n=100, strength = 0.5) +
geom_parallel_sets_axes(axis.width = 0.25, fill = "gray95",
color = "gray80", size = 0.15) +
geom_parallel_sets_labels(colour = 'gray35', size = 4.5,
angle = 0, fontface="bold") +
theme_minimal() +
theme(
legend.position = "none",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text.y = element_blank(),
axis.text.x = element_text(size = 12, face = "bold"),
axis.title.x = element_blank()
)
Figure 1: Multiple Matched Pairs
We can infer that the pair with a wider line is the true pair. For example, credit card 4795 (Blue line) matches both L2070 and L8566 loyalty cards, but we can make sure that L8566 is the true pair. Because the the consumption count by L8566, which is displayed as the line width, is much higher than L2070.
knitr::kable(card_correspond_count_others,
caption = "Other Matched Pairs") %>%
kableExtra::kable_paper("hover", full_width = F) %>%
kableExtra::scroll_box(height = "300px")
| last4ccnum | loyaltynum | count |
|---|---|---|
| 1286 | L3288 | 15 |
| 1286 | L3572 | 13 |
| 4795 | L2070 | 1 |
| 4795 | L8566 | 25 |
| 4948 | L3295 | 1 |
| 4948 | L9406 | 22 |
| 5368 | L2247 | 24 |
| 5368 | L6119 | 1 |
| 5921 | L3295 | 12 |
| 5921 | L9406 | 1 |
| 7889 | L2247 | 1 |
| 7889 | L6119 | 20 |
| 8332 | L2070 | 27 |
| 8332 | L8566 | 1 |
These matched pairs with 1 count might contain some suspicious activities. And we can assign the rows with over 5 count to be true pairs.
In the final predicted card pairs, only credit cards 1286 correspond to multiple loyalty cards (L3288, L3572), which can be found in Figure 1
card_correspond_count_others_ture <- card_correspond_count_others %>%
filter(count > 5)
# union the two true pairs table
card_correspond_true <- bind_rows(card_correspond_count_one2one,
card_correspond_count_others_ture)
To match credit card consumption and GPS data, we can assume that one car stop corresponds to one consumption if the consumption time falls within the car stop period at the same location.
But before that, we have to label GPS of car stops with specific locations.
All car stop locations are plotted on the map. And we can see there are many locations where the car stopped for over 6 hours (red dot on the map). Most of them are near the five parks (along the coast).
We are interested in car stops where credit card consumption happened, so we should exclude these stops which are very likely at home.
Besides, we notice that there are some long car stop near other locations. Those blue dots near “Ouzeri Elian” all belong to Isande, car 28. And he/she drives car very regular: stops at about 8:00 and leave at about 17:00.
It’s the same for car 9, Gustav. The car has many long stops near “Bean There Done That” (north-west area): stopped at about 17:00 and start moving on the second day at about 8:00. It seems that he lives here.
gps2_stop_long <- gps2_stop_sf %>%
filter(diff_mins >= 60*6)
gps2_stop_short <- gps2_stop_sf %>%
filter(diff_mins < 60*6)
map4 <- tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps2_stop_short) +
tm_dots(size = 0.1, alpha = 0.5) +
tm_shape(gps2_stop_long) +
tm_dots(col = "blue", size = 0.2, alpha = 0.3)
tmap_leaflet(map4)
Figure 2: Long and Short Car Stops
After excluding the long on short car stop, We can find that these car stops are still messy in Figure 2. It’s difficult to distinctly group car stops and label with locations.
Many car stops locations might not correspond to any local business locations. And some car stops, which actually correspond consumption in one location, don’t have close GPS location. Furthermore, some locations are very near each other.
To fix this issue, we can find the most confident correspondence firstly. For example, the multiple dots near the “Abila Airport”(southwest) should be a clear/distinct group. They are not single stops, close within one group and far from other dots&locations. Thus, We can believe that they correspond to the airport location with high confidence.
It’s the same for “Maximum Iron and Steel”(west), “Abila Scrapyard”(northwest), “Frank’s Fuel”(west), “Bean There Done That”(northwest), “Coffee Cameleon”(southeast), “Chostus Hotel”(northeast).
Some dots might not be a distinct group and there are several dot groups near one location. But some groups still can be labeled to one location with confidence, such as some dots along the street of the “Kronos Mart”(west). “Roberts and Sons”(west), “Desafio Golf Course”(northeast), “Albert’s Fine Clothing”(north), “Jack’s Magical Beans”(northeast), “Hallowed Grounds”(east) also have such dots groups.
After labeling these dots groups, we can match the credit card by the timestamp and location labels. It give us possible pairs of car id and credit card. For each pair, we will check whether every record has a unique corresponding a car stop record with this pair among 14 days. If all matches, this pairs will be regard as a confident pair.
# add index to get the distance from the distance matrix
gps2_stop_short$idx2 <- c(1:nrow(gps2_stop_short))
# calculate the distance between any two dot
distance_matrix <- st_distance(gps2_stop_short$geometry, gps2_stop_short$geometry)
# add a new col for labeling locations
gps2_stop_short$location <- ""
Take the dots group near “Frank’s Fuel”(west) as example:
We can hover in the map above to find one of dots close to the location and get the distinct ‘idx’ of this car stop.
Then we need to filter out all dots which belong to this group. To do this, we find all dots whose distances with it are less than 50 meters. And plot them on the map to check whether there are dots missing or the range of 50 meters is so large that it includes other others.
## "Frank's Fuel"(west),
# start with dot 2250, which is near this location
# find all dots which have less than 50 meter in distance
frank_idx <- which(as.integer(distance_matrix[2250,]) < 50)
dot_group <- gps2_stop_short %>%
filter(idx2 %in% frank_idx)
others <- gps2_stop_short %>%
filter(!idx2 %in% frank_idx)
# check whether there are other dots near this group but with different labels (blue dots) on the map
# if yes, we need to change the dot distance from default value(50) to a bigger value
mapx <- tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(dot_group) +
tm_dots(size = 0.1, alpha = 0.4) +
tm_shape(others) +
tm_dots(col = "blue", size = 0.2, alpha = 0.5)
tmap_leaflet(mapx)
The two dots close to the “Frank’s Fuel” are grouped into grey dots.
After that, we will label them with location “Frank’s Fuel” and match with consumption records in credit cards by location and timestamp(the consumption time need to be within the start time and the end time of the car stop). It will give us possible pairs of one car and one credit card.
# label them with this location
gps2_stop_short$location[gps2_stop_short$idx2 %in% frank_idx] <- "Frank's Fuel"
# match records in cc
gps2_stop_short %>%
filter(idx2 %in% frank_idx) %>%
left_join(cc, by = c('location')) %>%
filter(timestamp > start & timestamp < end) %>%
dplyr::select(id, start, end, timestamp, last4ccnum)
Simple feature collection with 2 features and 5 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 24.84133 ymin: 36.07212 xmax: 24.84135 ymax: 36.07213
Geodetic CRS: WGS 84
# A tibble: 2 x 6
id start end timestamp
<dbl> <dttm> <dttm> <dttm>
1 15 2014-01-08 11:45:01 2014-01-08 12:33:01 2014-01-08 12:29:00
2 3 2014-01-18 18:07:01 2014-01-18 18:40:01 2014-01-18 18:39:00
# ... with 2 more variables: last4ccnum <dbl>, geometry <POINT [°]>
we find 2 possible pairs. Let’s check them separately.
For car id 15 and cc 3853:
# match all records of car id 15, cc 3853
match_cc <- subset(cc,last4ccnum == 3853) %>%
left_join(subset(gps2_stop_short, id == 15, select = c(start,end, day, idx2)),
by = c('day')) %>%
filter(timestamp > start & timestamp < end)
nrow(match_cc)
[1] 33
match_cc
# A tibble: 33 x 11
timestamp location price last4ccnum date day
<dttm> <chr> <dbl> <dbl> <date> <int>
1 2014-01-06 08:06:00 Brew've Been~ 18.2 3853 2014-01-06 6
2 2014-01-06 13:59:00 Katerina’s C~ 39.4 3853 2014-01-06 6
3 2014-01-07 12:00:00 Brewed Awake~ 64.8 3853 2014-01-07 7
4 2014-01-07 13:53:00 Hippokampos 51.7 3853 2014-01-07 7
5 2014-01-07 20:14:00 Guy's Gyros 62.3 3853 2014-01-07 7
6 2014-01-08 07:48:00 Brew've Been~ 32.9 3853 2014-01-08 8
7 2014-01-08 12:29:00 Frank's Fuel 65.3 3853 2014-01-08 8
8 2014-01-08 13:40:00 Guy's Gyros 77.9 3853 2014-01-08 8
9 2014-01-08 20:17:00 Frydos Autos~ 183. 3853 2014-01-08 8
10 2014-01-09 12:00:00 Jack's Magic~ 6 3853 2014-01-09 9
# ... with 23 more rows, and 5 more variables: hour <int>,
# start <dttm>, end <dttm>, idx2 <int>, geometry <POINT [°]>
33 credit card records match the car stop period of car 15. Check whether there are cc records which match multiple pairs.
match_cc %>%
group_by(idx2) %>%
summarize(count = n()) %>%
filter(count >1)
# A tibble: 0 x 2
# ... with 2 variables: idx2 <int>, count <int>
Check whether there are any cc records which are not matched.
# A tibble: 0 x 7
# ... with 7 variables: timestamp <dttm>, location <chr>,
# price <dbl>, last4ccnum <dbl>, date <date>, day <int>, hour <int>
# All match, so label them in GPS with respective locations
gps2_stop_short$location[gps2_stop_short$idx2 %in% match_cc$idx2] <- match_cc$location
All records match, we think the credit card 3853 belongs to the owner of the car 15.
For car id 3, cc 9635, the steps are the same.
# match all records of car id 3, cc 9635
match_cc <- subset(cc,last4ccnum == 9635) %>%
left_join(subset(gps2_stop_short, id == 3, select = c(start,end,idx2, day)),
by = c('day')) %>%
filter(timestamp > start & timestamp < end)
# check whether there are cc records which match multiple car stops
match_cc %>%
group_by(idx2) %>%
summarize(count = n()) %>%
filter(count >1)
# A tibble: 0 x 2
# ... with 2 variables: idx2 <int>, count <int>
# check whether there are any cc records which are not matched
subset(cc,last4ccnum == 9635) %>%
setdiff(match_cc[1:7])
# A tibble: 2 x 7
timestamp location price last4ccnum date day hour
<dttm> <chr> <dbl> <dbl> <date> <int> <int>
1 2014-01-14 12:00:00 Bean Th~ 19.4 9635 2014-01-14 14 12
2 2014-01-15 12:00:00 Bean Th~ 5.34 9635 2014-01-15 15 12
# 2 out of 26 not match, but their locations are "Bean There Done That", it's still confident pair
# label them in GPS with respective locations
gps2_stop_short$location[gps2_stop_short$idx2 %in% match_cc$idx2] <- match_cc$location
The credit card 9635 belongs to the owner of the car 3.
Other dot groups will go through the same process to find pairs of cars and credits.
### confident pair
### cars&consumption with pairs
### match new pairs
###
Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships
1
[1] 1
1
[1] 1
1
[1] 1
Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why.
CEO周末在这住了两天
car33和car7都会在Chostus Hotel几乎同时到达
## "Chostus Hotel"(northeast).
# start with dot
1
[1] 1
1
[1] 1
trucks which were used for non-business issue
day*hour Now, let’s divide the units from days into hours:
# cc_freq_day_hour <- as.data.frame(xtabs(~location++day+hour, data = cc))
# cc_freq_day_hour$hour <- as.numeric(levels(cc_freq_day_hour$hour))[cc_freq_day_hour$hour]
# p3 <- ggplot(cc_freq_day_hour,aes(x=hour,y=location))+
# geom_tile(aes(fill=Freq),color="white")+
# scale_fill_gradient(low = "#EFF7FB", high = "#0D2330")+
# theme(panel.background = element_blank(),
# axis.ticks = element_blank(),
# axis.title = element_blank(),
# legend.title=element_blank(),
# plot.title = element_text(hjust=0.5))+
# facet_wrap(~ day, ncol = 7)+
# labs(title = "CC Frequency by hour of the day")
# ggplotly(p3)
# knitr::kable(card_correspong_count,
# caption = ) %>%
# kableExtra::kable_paper("hover", full_width = F)
1
[1] 1